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BACXC9VXKD  AN D  PURPOSES 


In  195^  Personnel  Reseereh  Branch^ research  scientists,  in  coordi¬ 
nation  with  the  Amor  Language  School,  undertook  to  revise  the  Any 
Language  Test  (AIPT)  which  had  been  in  use  since  I9fc9.  These  tests  were 
dssignsd  to  measure  reading  comprehension,  sural  comprehension,  and 
writing  ability  0 f  Any  personnel  assigned  to  linguistic  jobs.  The  first 
step  was  the  construction  sad  field  validation  of  revised  language  pro* 
ficlency  tests  for  two  representative  languages — Chinese -Mandarin  and 
Russian.  The  final  forms  of  these  two  tests  were  used  as  the  prototypes 
for  construction  of  33  subsequent  tests  in  other  languages.  It  was 
assumed  that  by  using  this  prototype  approach  the  validity  of  the  sub¬ 
sequent  tests  would  be  oosparable  to  that  of  the  two  prototypes  and  that 
cutting  scores  on  the  subsequent  teats  could  be  generalised  from  thoee 
established  for  the  two  prototypes.  The  primary  objective  of  this 
study  was  to  investigate  the  comparability  of  validity  coefficients  and 
cutting  scores  of  Amy  Language  Proficiency  Testa  for  three  additional 
languages-- French,  German,  and  Polish— not  previously  analysed. 


PROCEDURES 


EXPERIMENTAL  TESTS 

Three  experimental  tests  described  below  were  validated  on  Amy 
personnel  assigned  to  France  and  Germany. 

1.  Any  language  Proficiency  Teat  -  French,  DA  PT  3^39*  This 
test  is  presented  in  two  parte:  Part  I,  Listening  Comprehension  end 
Part  II,  Reading  Comprehension.  Each  part,  containing  60  items,  yields 
a  maximum  score  of  60. 

2.  Army  Language  Proficiency  Test  -  German,  da  PT  3^2,  and 

3.  Any  Language  Proficiency  Test  -  Polish,  DA  PT  3^72  are 
identical  to  the  French  Test  with  respset  to  format  and  item  types. 


1/On  3  December  l$6o,  the  organisation  was  deslpisted  Personnel  Research 
Branch,  Tbs  Adjutant  General's  Office.  Effective  1  January  1961  its 
title  was  changed  to  the  present  R  and  D  Command  facility. 


r 


In  an  earlier  study  (Dunn,  T.  F. ,  at  al.  Hay  1957),  Part  I  waa 
found  to  relate  highly  with  performance  on  Any  type  tasks  Involving 
conversational  skills;  a  similar  degree  of  relationship  was  established 
between  Fart  II  and  performance  on  tasks  involving  reading  and  writing 
skills. 


I 


CRITERION  MEASURES 

Two  expert  linguists  with  fluency  in  both  the  appropriate  foreign 
language  (French,  German,  or  Polish)  and  English  evaluated  each  examinee's 
performance  and  translator  work  in  simulated  Interpreter  staples.  Trans¬ 
lator  work  samples  (WS)  were  administered  and  evaluated  before  the 
Interpreter  work  saaple  (IWS).  The  work  samples  included: 

Translator  Work  Samples  —  French/  DA  PT  3638;  German,  DA  FT  3640; 

Polish,  DA  PT  3642 

Interpreter  Work  Samples  —  French,  BA  FT  3639;  German,  DA  PT  3641; 

Polish,  DA  PT  3643 

The  Translator  Work  Seaple  required  an  examinee  to  translate  (write 
down)  typed  statements  from  the  foreign  language  into  English  and  vice 
versa.’  Fifteen  statements  were  presented  for  each  performance  type, 
me  translated  statements  were  then  evaluated  by  two  experts,  ihe  total 
score  was  the  sum  of  the  evaluations  of  each  statement.  In  most  cases 
the  same  subject  matter  experts  evaluated  both  translator  and  Interpreter 
work  sasple  performance.  On  the  basis  of  findings  in  a  previous  study 
(Dunn,  T.  F.,  et  al.  May  1957),  It  was  believed  that  no  apparent  Influence 
obtains  between  work  sasqple  evaluations  made  by  the  seme  rater  as 
ccspared  with  those  made  by  different  raters. 

In  the  Interpreter  Work  Saople  the  examinee  waa  required  to  perform 
as  an  interpreter  between  an  English- speaking  Interrogator  and  a  French- 
speaking  Informant.  The  examinee  was  required  to  listen  to  30  stat samite 
or  questions,  15  of  which  were  read  by  the  interrogator  In  English  and 
restated  by  the  examinee  In  the  appropriate  language.  Then,  15  statements 
were  given  in  the  foreign  language  and  required  to  be  restated  in  English. 
Statements  and  questions  used  in  the  script  made  up  an  integrated  con¬ 
versation.  Structure,  format,  and  administrative  procedures  ware  identical 
far  each  of  the  three  languages.  The  total  score  was  the  sum  of  the 
evaluations  made  by  each  of  the  language  experts.  An  overall  rating 
of  the  examinee's  skill  as  an  interpreter  was  prepared  in  each  work  sample 
on  an  11-point  rating  scale  covering  five  major  skill  levels  of  usefulness 
as  an  Interpreter  in  terms  of  ability  to  ccenunicate  accurately  and 
completely.  An  exasple  of  the  scale  is  shown  in  Figure  1. 

Studies  of  Aray  personnel  who  have  been  tested  for  their  foreign 
language  proficiency  have  shown  a  high  relationship  bstween  reading 
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IHTBRPRBTHl  RATXHO  SKILLS 

BOW  USEFUL  18  IBIS  MAX  PGR  AX  HEBHHBBDB  JOB  TM  ISMS  07  BX8 
ABIUTr  10  OOMUICA3ST 

51 _  He  could  not  handle  any  interpreter  assignment. 


He  should  be  used  aa  an  Interpreter  OKU  in  an  —ergency. 


He. could  be  used  on  nost  lover  level  interpreter 
aaaignaente. 


He  could  be  used  on  aoet  Interpreter  assignments  except 
those  requiring  highly  literate  native  fluency  or  its 
equivalent. 


He  would  be  successful  at  handling  any  hind  of  in¬ 
terpreter  assignments. 


Lgure  1. Rating  Scale  -  Interpreter  Wori 


conprehenslon  skills  and  conversational  usage  skills.  For  exaaple,  in 
one  study  of  the  English  Fluency  Battery  (Robinson,  J.  E.,  et  al.,  April 
1957)  Intercorrelation  coefficients  between  speaking,  understanding, 
and  reading  ability  in  English  of  597  Puerto  Ricans  ranged  from  .50  to 
•75.  In  a  study  of  the  AITT  prototypes  (Dunn,  T.  F. ,  et  al. ,  My  1957) 
the  range  of  intercorrelation  coefficients  between  speaking-understanding 
type  criteria  (Xntexpreter-Audlomonltor  Work  Saaples)  and  reading- 
writing  type  criteria  (Translator  Work  Samples)  was  .59  to  .78. 


SAMPLES 


The  experimental  teste  were  administered  during  May  sad  June  1958 
to  334  military  personnel  on  duty  In  Prance  and  Germany.  116  examinees 
were  given  the  ALPT-French;  H5»  the  AIPT -German;  and  103,  the  AIPT- 
Pollsh.  Slightly  lass  than  one-half  of  the  men  vers  serving  In  Jobs 
requiring  proficiency  in  language  skills  at  the  time.  Within  this 
limited  linguist  segment  (considered  typical  for  Amy  linguist  populations) 
a  relatively  broad  range  of  language  ability  was  evidenced  by  performance 
on  the  work  samples.  Table  1  shows  score  range,  Mans,  and  standard 
deviations  for  each  of  the  work  samples  for  all  three  tests  for  this 
segment. 


Table  1 

SCORE  RANGE,  MEANS,  AND  STANDARD  DEVIATIONS 
ON  LANGUAGE  WORK  SAMPLES  FOR  EXAMINEES  ASSIGNED 
TO  LINGUISTIC  JOBS 


Language 

Work  Sasg>le 

Actual 

Score  Range 

Mean 

3.  D. 

French 

Translator 

• 

O 

16.31 

9.56 

Interpreter 

0  -  6ob 

39.  *3 

18.30 

German 

Translator 

O 

« 

)S 

16.52 

7.34 

Interpreter 

0-60 

40.15 

11.55 

Polish 

Translator 

• 

O 

12.39 

7.27 

Interpreter 

0-60 

38.02 

14.55 

®The  possible  range  for  the  Translator  Work  Sample  for  all  three 
languages  was  0  -  30. 

bThe  possible  range  for  the  Interpreter  Work  Sample  for  all  three 
languages  was  0  -  60. 


ESTABLISHING  CUTTING  SCORES 


To  establish  required  levels  of  proficiency  for  the  Any's  needs, 
it  was  necessary  to  determine  qualifying  scores.  Subject  setter  experts 
coexisted  an  ltaa-ty-ltem  evaluation  of  an  examinee's  performance  on  the 
Interpreter  Work  Semple  and  also  a  rating  of  his  overall  performance. 

rating  scale  contained  "built-in"  cutting  points  defined  in  terms  of 
"Unsatisfactory",  "Poor",  "Pair”,  and  "Good”.  Using  the  equal  percentile 
method,  criterion  cutting  could  then  be  related  to  cutting  scores  on  the 
language  proficiency  tests.  Comparable  ratings  were  not  available  for 
the  translator  work  samples.  However,  on  the  basis  of  the  range  level 
of  correlation  coefficients  between  Interpreter  and  translator  work 
samples  (.66  -  .79),  it  was  considered  operationally  feasible  to  use 
the  Interpreter  Work  Staple  cutting  points  for  Translator  Work  Saaple 
in  determining  cutting  scores. 

Operationally,  both  numerical  and  adjectival  scores  are  recorded 
on  Form  20,  Soldier's  Qualifying  Record,  for  performance  on  the  Any 
Language  Proficiency  Tests.  The  adjectival  descriptions  are  "Good", 
"Fair",  and  "Poor”.  As  has  been  mentioned,  cutting  points  on  the 
predictor  measures  were  set  by  using  the  same  percentile  at  which 
cutting  points  fell  on  the  criterion  measures.  For  example,  it  was 
determined  that  the  bottom  of  the  "Good"  category  fell  at  the  7&th 
percentile  on  the  rating  form.  This  same  percentile  was  used  to  set 
the  cutting  score  for  the  bottom  of  the  "Good"  category  on  the  predictor. 
For  administrative  purposes,  it  was  desirable  to  establish  a  common 
cutting  point  at  each  of  the  descriptive  levels  for  all  language 
proficiency  tests.  In  order  to  achieve  one  set  of  cutting  scores  on 
each  part  of  the  tests,  the  mean  of  all  three  language  tests  at  each 
proficiency  level  was  computed.  The  average  cutting  scores  thus  derived 
are  given  In  Table  2. 

Since  it  was  also  administratively  desirable  to  use  a  common 
cutting  score  for  52  other  language  tests,  it  was  Important  to  have  an 
indication  of  the  amount  of  mlsclassiflcatlon  which  would  occur  in 
using  the  generalization  procedure.  For  this  purpose  an  individual 
was  considered  to  be  misclaseifled  if,  as  a  result  of  using  the  common 
cutting  score  for  mil  languages,  he  .was.  placed  in  a  different  descriptive 
category  (Good  -  Fair  -  Poor),  than  he  would  be  if  separate  cutting 
scores  were  set  for  each  language.  Percentages  were  computed  of  cases 
misclasslfled  (Table  2)  in  each  category  for  each  part  of  the  three 
tests. 


RESULTS 


Interrelationships  among  the  criterion  and  predictor  variables 
as  well  as  means  and  standard  deviations  for  the  French,  German,  and 


Table  2 


MEANS  07  THE  CUTTING  SCORES  FOR  THREE  LANGUAGES  AND 
PERCENT  07  CASES  WHICH  WERE  THUS  HISCLASSZFIED  AT  EACH  CASBOORT 


Variable 

Mean 

Cutting 
Score  1 

Percent  Misclaaslfled 

French  German  Polish 

Listening  Cceprehension 

Good 

48.5 

0 

6 

10 

Fair 

57*5 

1 

2 

13 

Poor 

a. 5 

4 

10 

3 

Reading  Comprehension 

Good 

48.5 

10 

0 

9 

Fair 

37*5 

9 

1 

0 

Poor 

a. 5 

8 

8 

5 

Polish  language  Proficiency  Tests  are  arusnarlted  in  Table  3*  In  the 
current  study,  intercorrelation  coefficients  for  the  Translator  Work 
Sample  and  the  Interpreter  Work  Saaple  are:  French  .79,  German  .74, 
and  Polish  .66.  These  coefficients  are  consistent  In  magnitude  with 
those  found  in  other  studies  on  the  ALPT.  Kuder -Richardson  (formula 
20)  reliability  coefficients  for  the  two  parts  of  each  of  the  test 
were  consistenly  high— .90  to  .95.  The  specific  values  are  reported 
in  Table  3.  For  each  of  the  two  work  sanples  on  all  three  languages, 
Kuder-Richardson  (formula  20)  and  inter-rater  reliability  coefficients 
were  rather  high  (generally  in  the  90's).  These  estimates  of  criterion 
reliability  are  reported  in  Table  4.  Analysis  of  Interpreter  Work 
Saaple  scores  in  Part  I  against  those  for  Tranalator  Work  Saaple  a  in 
Part  IX  yielded  validity  coefficients  of  .83,  .66,  and  .73,  respectively 
for  the  French,  German,  and  Polish  tests* 


Table  k 


ESTIMATES  07  HEUABTLITT  07  TEE  CRITERION  MEASURES 


Variable 

Kuder-Rlchardson 
Reliability  (Formula  21) 

Inter-Rater 

Agreement 

Translator  Work  Sample 

Trench 

•  94 

.98 

German 

.89 

•95 

Polish 

•  90 

•91 

Interpreter  Work  Sample 

French 

•  99 

.96 

German 

•  95 

•91 

Polish 

.90 

.95 

SUMMARY  AMD  CCMCUJSIOKS 


^ Baaed  on-tfi^two  prototypes,  33  additional  language  proficiency 
tests  were  constructed. by  staff  members  of  the  Any  Language  School 
in  cooperation  with  research  scientists  9f_  tbejtersonnel  Research 
Branch,  The  Adjutant  General's  Office.  sAa  attempt  was  made  to  adhere 
to  prototype  item  content  and  cospoeltion  to  a  sufficient  extent  to 
Insure  relative  comparability  in  validity  and  difficulty  for  all  of 
the  new  tests.  The  present  Research  Memorandum  reports  the  results 
of  the  validation  and  statistical  analysis  undertaken  for  three  tests 
not  previously  covered  --  Trench,  German,  and  Polish.  Validity 
coefficients  obtained  were  of  sufficient  magnitude  to  indicate  that 
the  three  tests  are  highly  efficient  measures  of  language  proficiency. 
Coefficients  for  these  tests  (.66  to  .87)  closely  approximated  those 
for  the  Chlnese-Mandarln  and  Russian  prototypes  (.68  to  .86).  It  was 
concluded  that  the  tests  were  fairly  comparable  with  respect  to  validity. 
Levels  of  fluency  were  designated  "Cfciod",  ”7air",  soft  "Poor"  <**  cutting 
points  were  computed  at  these  levels  by  use  of  equal  percentiles  on 
the  criterion  and  on  the  predictor  meaaurea.  A  common  set  of  cutting  - — 


score*  was  established  for  both  parts  of  the  tests  and  for  all  three 
languages.  Amounts  of  misclassiflcatlon  estimated  to  result  from  this 
procedure  varied  from  Ojt  to  of  the  total  cases.  It  was  also  con¬ 
cluded  that  validity  and  comparability  assumptions  had  been  net  and 
that  common  cutting  scores  could  be  generalised  to  all  tests  of 
foreign  language  proficiency. . 
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